Identification of patient subtypes based on protein expression for prediction of heart failure after myocardial infarction

Summary This study investigates the ability of high-throughput aptamer-based platform to identify circulating biomarkers able to predict occurrence of heart failure (HF), in blood samples collected during hospitalization of patients suffering from a first myocardial infarction (MI). REVE-1 (derivation) and REVE-2 (validation) cohorts included respectively 254 and 238 patients, followed up respectively 9 · 2 ± 4 · 8 and 7 · 6 ± 3 · 0 years. A blood sample collected during hospitalization was used for quantifying 4,668 proteins. Fifty proteins were significantly associated with long-term occurrence of HF with all-cause death as the competing event. k-means, an unsupervised clustering method, identified two groups of patients based on expression levels of the 50 proteins. Group 2 was significantly associated with a higher risk of HF in both cohorts. These results showed that a subset of 50 selected proteins quantified during hospitalization of MI patients is able to stratify and predict the long-term occurrence of HF.


INTRODUCTION
Despite significant therapeutic improvements during the last decades, the long-term risk of heart failure (HF) after myocardial infarction (MI) remains significant and ischemic HF is a major cause of mortality worldwide. 1 Post-MI HF is too often diagnosed at a late stage when its irreversible consequences are already established. The estimation of the risk of HF in the early post-MI period currently relies on clinical variables, left ventricular function parameters, and conventional cardiac biomarkers such as troponin or B-type natriuretic peptide (BNP). However other cardiovascular biomarkers might reflect the activation of new potential pathways after MI and contribute to HF in both the short and long term. 2 Broader approaches are important for clinicians to understand the implications of different pathobiological axes. Despite the proliferation of candidate biomarkers, there is limited data comparing comprehensively the prognostic value of biomarkers when assessed in large arrays.
Explorative analysis using large scale protein measurement methods allow the simultaneous analysis of large biomarker panels. Thanks to SOMA(Slow Off-rate Modified Aptamers)scan assay, over 5,000 proteins can be measured covering 8 logs of abundance in the human proteome. These proteins are not targeted toward any particular disease and, thus, such wide panels of proteins can help to discover new biomarkers using appropriate statistical methods to analyze these data. Very recently, aptamer assays were shown to provide excellent precision and an unprecedented coverage and promise for disease associations. 3 Using SOMAscan profiling (4,453 targets), Gui et al. showed that plasma multiprotein score improved risk stratification in patients with HF and reduced ejection fraction and identified novel candidates. 4 The aim of this work was to use circulating plasma proteins expression levels quantified during the hospitalization of patients suffering from MI to identify groups of patients able to predict the long-term risk of occurrence of HF after MI. The prospective REVE-1 (REmodelage VEntriculaire) cohort 5 and REVE-2 cohort 6 in which patients with a first MI were included and underwent long-term follow-up 7 were used respectively as derivation and validation cohorts. A large panel of 4,668 proteins was measured with the SOMAscan assay in blood sample collected during initial hospitalization. To build groups of patients able to predict the occurrence of HF, the analysis was divided into three steps ( Figure 1). First, a protein selection step was performed to focus on the relevant proteins in the derivation cohort, REVE-1. Second, the selected proteins were used to build groups of patients on REVE-1 using k-means, an unsupervised clustering algorithm. Patients of REVE-2, the validation cohort, were assigned to the groups built on REVE-1. Third, the groups' predictive ability for occurrence of hospitalization for HF was assessed using competing risk models in both cohorts. In addition, Ingenuity Pathway Analysis (IPA) and Gene Ontology (GO) analysis were performed to define molecular networks enriched from the selected proteins.
Evidence before this study HF following an MI is too often diagnosed too late when its irreversible consequences are established. The estimation of the risk of HF in the early post-MI period currently relies on clinical variables, left ventricular function parameters, and conventional cardiac biomarkers such as troponin or BNP. Despite proliferation of candidate biomarkers, there is limited data comparing comprehensively the prognostic value of biomarkers.

Added value of this study
We performed a discovery proteomics approach by quantification of 4,468 proteins in two cohorts of patients with a first MI, REVE1 (derivation cohort) and REVE 2 (validation cohort). A total of 50 proteins were selected to be significantly associated with the occurrence of hospitalization for HF. An unsupervised clustering method identified 2 groups of patients based on the expression levels of the 50 proteins in the REVE1 cohort that were validated in the REVE2 cohort. Differences in protein expression led to identifying Figure 1. Overview of the study REVE-1 study 5 was used as derivation cohort and REVE-2 study 6 as validation cohort. Proteomic data analysis was performed on all plasma samples collected during hospitalization of patients from both cohorts by SOMAscan assay (version V4.0). The SOMAScan platform measured accurately 4668 plasma proteins. Log2 transformed data were centered and reduced. Standardization parameters were calculated with the data from the REVE-1 cohort (see Table S2) and then applied to both REVE-1 and REVE-2 cohorts. Univariate competing risk models were fitted for each protein, and significance tests selected 50 proteins to be associated with occurrence of hospitalization for HF (p< 1 $ 07 10 À5 in accordance with Bonferroni's method). Clustering was then performed on the REVE-1 patients with the 50 selected proteins, by k-means procedure with k = 2 groups. Patients of REVE-2 were then assigned to one of the groups identified on REVE-1. A competing risk model was then developed based on the group information on both cohorts. Article key physiopathological processes combining differences in molecules leading to two groups of patients, low (group 1) and high (group 2) risk of long-term adverse cardiac outcomes following MI.

Implications of the available evidence
Based on the expression of the 50 proteins, group 2 of patients was associated with a high risk of occurrence of HF. Stratification of MI patients based on proteins involved in cell-to-cell communication should be important in the future.

Study populations
Patients in both cohorts had similar age and gender; indicators of MI size (wall motion score index and peak creatine kinase) and Killip class were also similar ( Table 1). Patients in the REVE-2 cohort were more often treated by primary PCI and less often by thrombolysis. Statistical differences were observed between both cohorts for diastolic blood pressure, heart rate, end systolic volume, end diastolic volume and wall motion iScience Article systolic index. The mean follow-up was 9 $ 2 G 4.8 years in REVE-1 and 7.6 G 3.0 years in REVE-2. One patient was lost during the follow-up leading to 254 patients in REVE-1 for the analysis; no patients in REVE-2 were lost during the follow-up. The numbers of patients who reached the primary endpoint (hospitalization for HF) during follow-up were respectively 49 in REVE-1 and 28 in REVE-2. The numbers of patients who reached the competing event (death from all causes) during the long term follow-up were respectively 63 in REVE-1 and 26 in REVE-2.

Selection of proteins
Standardization parameters for each protein and each patient were calculated using the mean and standard deviation (SD) calculated with the data from REVE-1 cohort (Table S2). The proteomic data of REVE-2 were then standardized protein by protein for each patient using these values following the calculation:((value of the protein) -(the mean of protein in REVE-1))/(the corresponding SD in REVE-1).
After standardization, 50 proteins were selected to be significantly associated with the outcome event in REVE-1 (listed in Table 2 with their significance levels and subhazard ratios and in Table S3 for their location and type). Among the 50 proteins selected on REVE-1, 44 proteins on REVE-2 were regulated in the same manner for the outcome of REVE-2 patients and 6 were conversely but not significantly regulated (PDE4A, SAMHD1, RFNG, ST3GAL5, NMS and DPP4) ( Table 2). The correlation heatmap among the 50 selected proteins in REVE-1 shows that some of the proteins are highly correlated, with the highest correlation of 0.893 between B2M and CST3 ( Figure S2). This shows that several proteins carry a close tendency, which will be taken into account by the following clustering approach.

Patients groups
The silhouette criteria led to use k-means procedure with k = 2 groups on REVE-1 (not shown). Patients were split as 160 patients in group 1 and 94 in group 2 and the two groups show contrasted protein expression profiles ( Figure 2A). iScience Article These contrasted protein expression profiles were also found in REVE-2 cohort. Patients were mapped to one of the two groups identified on REVE-1 with respectively 190 patients assigned to group 1 and 48 to group 2 and similar characteristics were observed, with opposite protein expression profiles between the two groups ( Figure 2B).
We also observed significant clinical differences between the patients of the two groups in both cohorts with significant differences in REVE-1 for history of hypertension, diabetes, initial reperfusion therapy, multi-vessel coronary artery disease (CAD), final thrombolysis in myocardial infarction (TIMI) grade 3 flow in infarct-related vessel, heart rate, Killip class over 2, end-systolic volume, left ventricular ejection fraction, WMSI and for treatments at discharge (aldosterone antagonists). In REVE-2, gender, smoking and treatments at discharge (b-blockers) were significantly different between the identified groups. Significant differences for the age were found in both cohorts. Thus, protein based clustering leads to the identification of significant differences in the clinical characteristics of patients (Table 3).  The differential expression analysis performed between the identified groups showed that 43 and 41 of the 50 selected proteins had significantly different means between the two groups respectively in REVE-1 and REVE-2 (Table S4). These results highlight that the groups identified in REVE-1 have significantly distinct proteomic expression profiles that were validated in REVE-2.

Event prediction
Cumulative incidence curves in REVE-1 ( Figure 3A) and REVE-2 ( Figure 3B) showed that in both cohorts, the identified groups had distinct incidence with higher hospitalization for HF in group 2. We validated the performance of group 2 for the significantly higher risk of hospitalization for HF with an SHR of 7  (Table S5). iScience Article A sensitivity analysis was achieved by performing the clustering on the 50 selected proteins without the BNP, an established biomarker of HF. When performing the clustering on 49 proteins, only 2 patients from REVE-1 changed of groups and 1 patient form REVE-2 changed of groups. The cumulative incidence curves of these groups were very similar to those obtained with the 50 proteins ( Figure S3).

Signaling pathway analysis
To gain further insight into the potential mechanisms in worse outcome of post-MI patients, the 50 proteins significantly associated with occurrence of HF (  Figure 4).
We also examined the significant relationship with the biological pathways involved in ''Function and Diseases pathways'' (ranked below 500 in IPA) related to ''cardiovascular disease'' and ''cardiovascular system development and function'' (Table S7). The key proteins were in the high-scoring networks as well in networks 8 and 9 with NPPB and DPP4 (dipeptidylpeptidase 4) (network 1), PDE4A (phosphodiesterase 4A) and ADIPOQ (adiponectin) (network 2), CST3 (Cystatin-C) (network 8) and SERPINC1 (antithrombin-III) (network 9) present in most cardiovascular diseases selected. All these proteins were associated with hypertension for which we found significant differences between the two groups of variables identified in both cohorts (Table 3).

DISCUSSION
This study investigated the ability of a large set of biomarkers to identify post-MI patients with long-term occurrence of hospitalization for HF. Our results showed that a subset of 50 plasma proteins measured at time of hospitalization allowed identifying two groups of patients with distinct proteomic expression The networks were identified using IPA computational algorithms consisting of the 50 selected proteins (Table 2) and their direct interactions with other proteins (''interconnecting'') in the knowledge base. Scores were calculated for each network according to its fit to the set of selected proteins and used to rank networks on the Ingenuity analysis (version 73,620,684 Release: 2022-03-12). The scores take into account the number of selected proteins and the size of the network to approximate the relevance of the network to the original list of selected proteins. Selected molecules are in bold font (see Table 2) and interconnecting proteins identified in the network are in normal font. See also Tables S6 and S7. Protein signature for high risk of adverse outcome after in patients with MI The relationships between individual biomarkers and adverse outcome following MI have been previously reported. Increased levels of BNP, cardiac troponin, and C-reactive protein were associated with major cardiac events after MI. [8][9][10] In more recent studies, matrix metalloproteinases and other biomarkers of extracellular matrix turnover have also been shown to predict outcome in this setting. 11,12 For decades, risk prediction in clinical practice has been based on generally available clinical characteristics and conventional cardiac biomarkers such as BNP.
The present study is one of the first to assess the relationship between a large set of biomarkers (>4,500 proteins) and long-term clinical outcome using high-throughput technology combined with state-of-art statistical and clustering analyses. We used two prospective cohorts, REVE-1 and REVE-2, which included patients suffering from a first acute MI with blood sampling during hospitalization. Patients underwent long-term follow-up with nearly no loss of patients. Thanks to the SOMAscan assay (version 4.0), 4,668 proteins were measured in the plasma of all the patients included in both cohorts. These proteins were not targeted toward any particular disease allowing us to study the link between the event and the proteins without preconceived idea of which proteins should be investigated thus leading to the potential discovery of new biomarkers.
The high-dimensional data generated by these two studies, with more variables than individuals, makes standard statistical analysis unusable. To minimize the risk of overfitting, state of the art statistical approaches with rigorous stability selection procedure were used to ensure reliability of our finding. First, we selected 50 proteins over the large panels of proteins that were significantly associated with the occurrence of HF. Second, a clustering algorithm was used on these 50 proteins giving the same weight to all of them, resulting in two groups where the information of all the proteins is equally represented. Third, competing risk models were then used to model the occurrence of hospitalization for HF against death for all causes using the group variable: low and high risk of occurrence of HF. To validate these effects of groups, we used the validation cohort, REVE-2, for which standardization parameters were set up from the data of the derivation cohort, REVE-1, enabling to use them for other cohorts as we did with REVE-2. We confirmed the effect of groups on occurrence of HF characterized in REVE-1 using competing risk models in REVE-2, as in REVE-1. These two groups of patients only identified by their circulating levels of proteins also showed differences in the clinical characteristics allowing the stratification of the patients in two groups, low and high risk of adverse outcome following MI. These results may help clinicians for more targeted and personalized treatments for patients regarding their cardiac outcome.
The two identified groups of this study based on the measurement of 50 selected proteins with the same technology can easily be applied to other cohorts (or single patient) using the standardization parameters set up.

Translation of the identified proteins into biological pathways
The proteins found in our network analysis were translated into biological pathways typically related to HF. The network analysis showed that pathways specifically upregulated in MI patients with high risk of hospitalization for HF (group 2) were related to cell-to-cell communication and cardiovascular disease.
The key proteins are NT-proBNP, ADIPOQ, SERPINC1, and WNT3A. NT-proBNP is associated with cardiac stretch with plasma levels widely used for screening and diagnosis of HF 13 and was previously found to be a specific hub in network analysis of patients with HF and reduced ejection fraction in two independent The IPA analysis was performed on the 50 proteins selected to be associated with long term survival and listed in Table 2 (version: 73,620,684 Release: 2022-03-12). Only network 1 (A) and network 2 (B) from Table 4 are presented. Nodes are displayed using various shapes that represent the functional class of the proteins as published on https://qiagen.secure. force.com/KnowledgeBase/KnowledgeIPAPage?id=kA41i000000L5rTCAS. The color of proteins indicates respectively their regulated expression (red: increased and green: decreased) associated with occurrence of hospitalization for HF.
The arrows indicate the modulatory effect of protein on its interacting proteins. Only direct interactions were selected. Detailed information on the molecules present in the networks is detailed in Table 4. See also iScience Article studies. 14,15 ADIPOQ has been shown to affect the autophagic response in the heart and contribute to accelerate cardiac remodeling. 16 NT-proBNP was closely related in network 1 with PRKN, protein involved in autophagy and both were represented in most cardiovascular diseases (Table S7). SERPINC1 was also highly represented in cardiovascular diseases and by GO enrichment, but its potential as biomarker in HF has not been described up to now. Same for WNT3A, which is found in the 2 selected clusters by GO analysis and has been shown to be involved in cardiac muscle cell differentiation and its upregulation has been shown to be involved in TGFb1-induced cardiac hypertrophy 17 but also in cardiomyocyte injury following hypoxia. 18

Clinical implications and future perspectives
Individualized risk assessment is an integral part of management for patients in the post-MI setting. Early determination of the 50 selected proteins may help to detect patients at high risk of adverse outcome in the post-MI period. Such identification may encourage more aggressive therapy for this high-risk group. The identified groups were not impacted by the clinical differences between the two cohorts, showing the robustness of the two groups to clinical variations. Regarding this, the group's ability to predict HF for patients from other cohorts should remain relevant. Finally, the pathophysiological mechanisms beyond the selected proteins enriched in the networks that are highly represented in cardiovascular diseases remains to be established.

Limitations of the study
The mean sampling time of the two cohorts are different but the results are still significant, which shows that the identified groups are robust to sampling time variations and could be used for clinical prognosis.
The choice of the silhouette criteria for the number of groups could be discussed as many criteria exist in the literature. We believe that two groups are appropriate and lessen the risk of overfitting. Although we believe our strategy valid, external validation in large cohorts from other areas/countries is mandatory to confirm the predictive value of the identified biomarkers subset to classify the MI patients in high-or lowrisk of adverse outcome. Our clustering approach was validated in REVE-2, the validation cohort. The patients from the validation cohort were recruited in the same region as the derivation cohort, but standardization parameters were set up for an external validation with other cohorts of patients recruited everywhere in which circulating proteins were measured by a similar technology.
Although most patients had acute reperfusion, the proportion of patients with primary PCI reflects the practice in 2002-2004 and 2006-2008 and was lower than it would be nowadays. Finally, our study populations consisted of mainly men (74 and 81%, respectively in REVE-1 and REVE-2), therefore our prediction model may be less suitable for women. In addition, the patients recruited in the two studies suffered from severe MI and the results might not apply to the overall population of patients after MI.

Conclusions
Here, we have investigated the cardiac prognostic implications of the highest panel of biomarkers available in MI patients, thanks to the aptamer-based platform. This study has several clinical implications. First, we improve the prediction of the long-term risk of HF occurrence in MI patients; second, the results obtained provide biological context for long-term adverse cardiac outcomes. The proteins found in our network analysis were in link with cardiovascular diseases.
We were able to validate our findings in an independent cohort, significantly reducing the overfitting effect by focusing on proteins linked to the outcome to ensure that results are specific to HF. These promising findings will require external validation in additional ethnic groups of patients.

STAR+METHODS
Detailed methods are provided in the online version of this paper and include the following:  iScience Article

Materials availability
This study did not generate new unique reagents.
There are restrictions to the availability of SOMAScan (property of Somalogic company) and human samples (French cohorts of patients associated with clinical information).

Data and code availability
Any additional information required to reanalyze the data reported in this paper is available from the lead contact upon request.
Data: All (clinical and proteomic) data reported in this paper will be shared by the lead contact upon request.
Code: This paper does not report original code.

Data deposition and materials sharing
The data and methods used in the analysis are available to any researcher for purposes of reproducing the results or replicating the procedures.

Study populations
The REVE studies have been previously reported. 5, 6  iScience Article SOMAscan analysis was obtained by ''Comité deProtection des personnes Nord Ouest IV'' (February 2018). IRB approval was obtained and that subjects gave informed consent. The inclusion criteria were the same: a first anterior Q-wave MI with R 3 akinetic segments at predischarge echocardiography. Exclusion criteria were inadequate echocardiography image quality, life-limiting noncardiac disease, significant valvular disease, or previous Q-wave MI in both studies. A long-term clinical follow-up was performed by contacting the general practitioner or cardiologist, or the patients themselves. 7 We collected data on death and hospitalization for HF. All events occurring during follow-up were adjudicated by two investigators with a third opinion in cases of disagreement. For hospitalizations during the follow-up period, hospital records were reviewed for evidence of clinical events. The events reported by the patients were systematically confirmed from the medical records. Hospitalization for HF was defined as hospitalization for symptoms of dyspnea or edema, elevated venous pressure, or interstitial or alveolar edema on chest X-ray, or the addition of intravenous diuretics or inotropic medications. The primary endpoint of the present study was hospitalization for HF and the competing event was death from all causes.

Plasma proteomics measurements
Plasma protein levels were measured in both cohorts with SOMAscan technology (version V4$0). [19][20][21] Peripheral blood samples were collected in EDTA-treated tubes for 255 patients in REVE-1 and 238 patients in REVE-2 after MI during the initial hospitalization. The mean blood sampling day was 7$4 G 3$9 days and 4$1 G 2$2 after MI, respectively in REVE-1 and REVE-2. Blood samples were then assayed according to the manufacturer's protocol, as previously described with the 1$3k assay. 22 Protein levels from the SOMAScan assay are expressed as relative fluorescence unit (RFU).

Standardization of proteomic measurement
There were no missing values in the set of the 5284 proteins quantified in both cohorts. A total of 414 SOMAmer measurements were removed from the analysis on request of the manufacturer. In the data set, 197 identical proteins were measured using two to three SOMAmers (Table S1). In order to give the same weight to each protein in the data set, for the proteins measured by several SOMAmers, the mean was calculated leading to 4668 proteins ready for analysis.
Proteomic expression levels were log2 transformed. Data quality was checked, and no variation was found between the 15 plates used for the measurements ( Figure S1). Results of brain natriuretic peptide (BNP) measurements with the SOMAScan assay were also compared to BNP measurements performed with the automated 2-site sandwich immunoassay on an Advia Centaur (Siemens Diagnostic, Zurich, Switzerland) and showed high correlation (r=0$992). Log2 transformed protein expression levels were standardized (centered and reduced) for each protein on REVE-1, the derivation cohort. These standardization parameters (mean and standard deviation) used in REVE-1 (Table S2) were then applied in REVE-2, the validation cohort.

Imputation of missing clinical values
Over the 27 clinical variables studied corresponding to 6858 and 6426 data of variables, respectively for all the patients from REVE-1 and REVE-2, 53 and 69 individual clinical data were missing respectively in REVE-1 (corresponding to 0$0077%) and REVE-2 (corresponding to 0$0107%). These values were imputed using the single imputation method missMDA. 23 Figure 1 shows the strategy used for analyzing the data obtained in the derivation (REVE-1) and validation (REVE-2) cohorts.

Proteins selection
In order to only focus on the relevant proteins, with a significant effect on the occurrence of hospitalization for HF, univariate competing risk models were fitted for each protein individually using R package cmprsk (version 2$2-11). 24 For each protein, the occurrence of hospitalization for HF was modeled in competition with the occurrence of all-causes death using the competing risk model as previously defined. 25 Tests were performed on the subhazard ratio (SHR) to measure the significance of the relationship between the protein's expression and the occurrence of hospitalization for HF. Then, in order to take into account multiple testing, the significance threshold was set to 1$07 10 -5 , which corresponds to the FamilyWise Error Rate

Clustering of patients based on selected proteins
A k-means clustering algorithm 27 was applied in the REVE-1 study using the selected proteins as variables in order to define groups of patients. This algorithm builds groups of individuals sharing similar values for protein expressions regardless of the risk of hospitalization for HF of the patients. Individuals are gathered into groups characterized by their centers and each patient is assigned to the group whose center is the closest in terms of Euclidean distance. Therefore, a new categorical variable called ''group variable'' was defined with these affectations for the following analyses.
In order to choose a suitable number of groups, the overall average silhouette width was computed. 28 The silhouette refers to a method of interpretation and validation of consistency within clusters of data. The average silhouette width is calculated with Euclidean distance to measure how similar is the mean distance of each individual compared to other individuals in the same group and to the individuals from other groups. The silhouette ranges from -1 to +1, where the closest value to 1 indicates that the clustering is appropriate. This criterion was computed for a number of groups varying from 2 to 6. Finally, the number of groups which gave the higher overall average silhouette width was used.
As data were standardized using the same parameters for both cohorts, patients of REVE-2 were then assigned to the previously built groups in REVE-1 using the same method.
Group differences were identified using Welch's test of mean equality for quantitative variables (both for clinical data and for differential proteomic analysis) or chi squared test for clinical categorical variables. Differences corresponding to raw p-values below 0.05 were considered significant.

Groups'effect on the occurrence of HF
In order to assess the interest of the previously created groups, cumulative incidence curves of hospitalization for HF were drawn for both groups. For each cohort, competing risk models were then used to model the occurrence of hospitalization for HF using the group variable. This allows to measure the strength and to test the significance of the group effect. Models were fitted for both cohorts in order to measure the effect of the groups in the derivation and the validation cohort. The models were then adjusted on the clinical variables age, gender, ejection fraction, diabetes, killip class, serum creatinine, BNP and NT-proBNP separately, in order to ensure the robustness of the information provided by the identified groups after adding relevant clinical information.
Sensitivity analysis was performed on the selected proteins without the established cardiac biomarkers (BNP, NT-proBNP). Incidence curve were drawn for the group identified with the selected proteins minus the established biomarkers and SHR competing risk model was fitted.
All statistical analyses were made using R (version 4.0.2).

Enrichment analysis
Gene ontology enrichment analysis was performed on the set of selected proteins using Biological Processes, Molecular Function and Cellular Component subsets. Enrichments tests were performed using the R package clusterProfiler. 29 For each GO term, a Fisher exact test was performed in order to test for the overrepresentation of the set of selected proteins in the GO term.

Ingenuity Pathways Analysis
The proteins of interest, selected to be significantly associated with occurrence of HF, were subjected to Ingenuity Pathways Analysis (IPA) (Current version: 73620684 Release: 2022-03-12, Ingenuity Systems) and used as a starting point for building biological networks. This analysis uses computational algorithms to identify networks consisting of focus proteins (proteins significantly modulated) and their interactions with other proteins (''non-focused'') in the knowledge base. Scores were calculated for each network according to the fit of the network to the set of focus proteins and used to rank networks on the IPA database restricted to direct interactions. IPA uses the proteins from the highest-scoring network to extract a Hybridization Control Normalization was developed to remove systematic biases present in the raw data after slide feature aggregation from a slide-based hybridization microarray for assay readout and quantification. Hybridization Control Normalization is performed using a set of twelve hybridization control sequences measured independently for each sample array. The procedure is intended to correct for systematic effects on the data introduced during the hybridization readout and results in a single scale factor for each sample that is subsequently applied to the measured signal on all features within a subarray (sample).
Intraplate Median Signal Normalization uses all the SOMAmer reagent signals on a given subarray to remove sample or assay biases that may be due to differences between samples in overall protein concentration, pipetting variation, variation in reagent concentrations, assay timing, and any other source of systematic variability within a single plate. Each SOMAmer reagent is assigned to one of three dilution sets, scale factors are derived within dilution sets separately, and all SOMAmer binding reagents within each set are scaled together. Three sample dilutions will result in three independent median signal scale factors for each subarray (sample) in addition to the hybridization scale factor. Thisstep is only applied to calibrator samples.
Plate Scaling and Calibration is accomplished using a number of replicate measurements of a common pooled calibrator sample consistent with the assay sample type for a study. Calibrator samples must be composed of identical sample matrices as the samples that are being calibrated. No protein spikes are added to the calibrator samples -SomaLogic relies solely on the endogenous levels of each analyte within a calibrator sample. Since calibration attempts to correct plate-to-plate variation and such variation can be idiosyncratic for SOMAmer binding reagents, a unique calibration scale factor is derived for each SOMAmer binding reagent within the assay. The median of these scale factors is then computed and applied across all SOMAmer measurements in that plate to account for the total signal difference (plate scale), and the scale factors are subsequently recalculated for each SOMAmer and applied to all measurements within the set of samples in that plate.
Median Signal Normalization to a Reference occurs on a per-sample basis, wherein a scale factor for a set of SOMAmer reagents is computed against a reference value generated from a cohort of healthynormal individuals and then aggregated within a dilution. The median of each dilution's scale factors is then applied to their respective SOMAmer reagents. Thisstep is applied to QC, Buffer, and individual samples. iScience Article concentrations to give measured relative fluorescence units (RFU) that span the dynamic range of the assay. The global reference RFU value for each hybridization control is defined by the median signal measured within the current plate being normalized. A ratio is computed by dividing the median RFU for each control by its measured RFU in the sample. The median of these hybridization control measurement ratios in each subarray defines the sample-based hybridization scale factor. By definition, such a scaling will equate the median RFU for the hybridization controls to the median reference RFU for the controls. All SOMAmer reagent results within a sample are multiplied by the resulting hybridization scale factor increasing or decreasing the overall ''brightness'' of the sample. The procedure is displayed graphically in Note Figure S4.
Intraplate Median Signal Normalization. Intraplate Median Signal Normalization is performed on each sample dilution independently. In most matrices, each SOMAmer binding reagent is assigned to one of three dilution sets, scale factors are derived within dilution sets separately, and all SOMAmer reagents within each set are scaled together. Within each sample matrix, this is only performed on calibrator samples. Like Hybridization Control Normalizationwhich uses a local reference standard, the local median reference RFU for each SOMAmer reagent is the median RFU for that SOMAmer binding reagent within the sample group (calibrator in buffer) in the plate to be normalized. As in hybridization normalization, a ratio is computed for each SOMAmer reagent by dividing the reference SOMAmer RFU by its measured RFU in the sample to be normalized. The median of the SOMAmer measurement ratios for all SOMAmer reagents in a dilution defines the sample-based scale factor for all SOMAmer reagents within that dilution and sample. All SOMAmer reagents within the dilution for a sample are scaled by the resulting median signal scale factor. Three sample dilutions will result in three independent median signal scale factors for each sample in addition to the hybridization scale factor as shown in Note Figure S5.

Plate Scaling &calibration
Clinical sample studies are plate scaled and calibrated to remove systematic assay variability. A set of control calibrator samples is used to detect and remove systematic variability between independent assay plates. Calibrator samples must be of the same type as the samples that are being calibrated. Calibrator global reference RFU values for each SOMAmer reagent are defined by the median signal measured on a set of samples spanning a number of independent assay plates that have been shown to meet assay acceptance criteria. For each SOMAmer reagent, the median RFU signal for that SOMAmer reagent across all the calibrator samples within the clinical study defines the global calibrator reference for that SOMAmer binding reagent. Note Figure S6 below displays the data from a typical clinical study and illustrates the systematic bias removed by calibration.
Plate scaling is performed on an entire independent plate. A local median reference value is derived for each SOMAmer reagent by computing the median RFU for that SOMAmer reagent from the set of replicate calibrator samples within the plate. The SOMAmer-based calibration scale factor is then computed by dividing the calibrator global reference RFU by the local median reference value defined for each SOMAmer reagent. The median of all scale factors for a given plate is then applied across all SOMAmer measures in the plate (plate) forcing the overall calibrator median signal to match the overall median signal within the global calibrator reference.
Plate-to-plate calibration is performed on each SOMAmer measurement within the plate independently. A local median reference value is derived for each SOMAmer reagent by computing the median RFU for that SOMAmer reagent from the set of replicate calibrator samples within the plate. The SOMAmer-based calibration scale factor is then computed by dividing the calibrator global reference RFU by the local median reference value defined for each SOMAmer reagent. This scale factor is applied to all SOMAmer measurements in the plate, forcing the median calibrator signal to match the global calibrator reference for that SOMAmer binding reagent. Each plate within a study has a unique calibration scale factor for each SOMAmer reagent. The data from Note Figure S5 are displayed after calibration in Note Figure S6. iScience Article in practice, the primary difference being the origination of the reference value. A ratio is computed for each SOMAmer reagent by dividing the global reference SOMAmer RFU by its measured RFU in the sample to be normalized. The median of the SOMAmer measurement ratios for all SOMAmer reagents in a dilution defines the sample-based scale factor for all SOMAmer reagents within that dilution and sample. All SOMAmer reagents within the dilution for a sample are scaled by the resulting median signal scale factor. Three sample dilutions will result in three independent median signal scale factors for each sample in addition to the hybridization scale factor.

Acceptance criteria
Hybridization Control and Intraplate Median Signal Normalization scale factors are expected to be in the range of 0.4-2.5. The plate scale factor is expected to be between 0.4 and 2.5. The distribution of QC sample ratios is expected to have 85% of individual SOMAmer reagents in the total array between 0.84 and 1.19 (i.e. less than 15% in the tails of the distribution). Gaussian distributions of scale factors are expected. A report is provided for each study (single plate or set of plates) with the results of the Normalization and Calibration process.

ll
OPEN ACCESS